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Abstract 

In this paper, we propose two mixed-effects models for Genetic Analysis Workshop 18 (GAW18) longitudinal blood 
pressure data. The first method extends EMMA, an efficient mixed-model association-mapping algorithm. EMMA 
corrects for population structure and genetic relatedness using a kinship similarity matrix. We replace the kinship 
similarity matrix in EMMA with an estimated correlation matrix for modeling the dependence structure of repeated 
measurements. Our second approach is a Bayesian multiple association-mapping algorithm based on a mixed-effects 
model with a built-in variable selection feature. It models multiple single-nucleotide polymorphisms (SNPs) 
simultaneously and allows for SNP-SNP interactions and SNP-environment interactions. We applied these two methods 
to the longitudinal systolic blood pressure (SBP) and diastolic blood pressure (DBP) data from GAW18. The extended 
EMMA method identified a single SNP on Chr5:75506197 (p-value = 4.67 x 1CT 7 ) for SBP and three SNPs on 
Chr3:23715851 (p-value = 9.00 x 10~ 8 ), Chr 1 7:5483421 7 (p-value = 1.98 x 10~ 7 ), and Chr21:1 8744081 (p-value = 4.95 x 
1 0~ 7 ) for DBP. The Bayesian method identified several additional SNPs on Chrl :1 7876090 (Bayes factor [BF] = 1 02), 
Chr3:1 97469358 (BF = 69), Chrl 5:87675666 (BF = 43), and Chrl 9:41 642807 (BF = 33) for SBP. Furthermore, for SBP, we 
found a single SNP on Chr3:1 97469358 (BF = 69) that has a strong interaction with age. We further evaluated the 
performances of the proposed methods by simulations. 



Background 

Genome-wide association studies (GWAS) have been 
used for examining genetic variants associated with 
blood pressure and hypertension [1,2]. Because blood 
pressure changes over time, it is important to collect 
multiple blood pressure measurements to study time- 
dependent genetic effects. Genetic Analysis Workshop 
18 (GAW18) data included systolic blood pressure (SBP) 
and diastolic blood pressure (DBP) measurements from 
a human whole genome sequencing (WGS) study [3]. 
The study was longitudinal, and the majority of partici- 
pants had three measurements collected at approxi- 
mately 5-year intervals. This paper proposes two mixed- 
effects models for GAW18 longitudinal SBP and DBP 
data. The first approach extends the EMMA method [4], 
an efficient mixed-model association-mapping algorithm. 
EMMA corrects for population structure and genetic 



relatedness using a kinship similarity matrix. We replace 
the kinship similarity matrix in EMMA with an esti- 
mated correlation matrix for the dependence structure 
of the multiple measurements from each individual. 
With this extended approach, hundreds of thousands or 
even millions of association tests can be performed effi- 
ciently. However, this approach tests only one single- 
nucleotide polymorphism (SNP) at a time and may have 
low power to map SNPs that interact with each other. 
Furthermore, it is not straightforward to tweak EMMA 
software for testing SNP by time interaction, an impor- 
tant question that can be addressed through longitudinal 
data. To address these concerns, we developed a Baye- 
sian method based on the composite model space fra- 
mework of Yi et al [5-7]. The proposed method fits 
multiple SNPs simultaneously. In addition, it allows for 
SNP-SNP interactions and SNP-time interactions. 
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Methods 

Extended EMMA 

For testing association between a given SNP and the 
longitudinal phenotype, we fit the mixed-effects model 

yi = fn+ xfp e + x\p g + Ui + ei[i = 1, ri) (1) 

where y t = (y il; ...,y;„.) T is the n,- x 1 phenotype vector 
of individual i; m = fil ni with M being the grand mean and 
In, being the m x 1 vector whose elements are all equal to 
1; xl is the design matrix corresponding to nongenetic 
covariates (e.g., time), and /3 e is the associated nongenetic 
effects; xf is the numerically coded genotype of individual i 
and p g is the corresponding SNP effect. In the model, we 
assume random effect M; ~ N(0, &gKi) where fQ is an 
rij x m matrix, and random error a ~ N(0, CT e 2 7 Hj ). The 
SNP effect can be tested as H 0 : P g = 0 versus Hi : p g f 0 
via the likelihood ratio test. For GWAS or WGS data, this 
test needs to be performed with a large number of SNPs, 
which can be computationally intensive if we treat KjS as 
the unknowns and estimate them jointly with the fixed 
effects. EMMA [4] is an efficient algorithm originally 
developed for GWAS data in which samples are poten- 
tially structured. EMMA models the structure effect via a 
similarity matrix. An R package that implements EMMA 
can either estimate the similarity matrix using genotype 
data or take any similarity matrix provided by users. We 
tweak EMMA for our purpose. We provide EMMA with 
the following similarity matrix K = diag[K\,K 2l ...,K„) 
where jfjs are the estimated correlation matrices from 
model (1) in which p g is set to 0. The idea of estimating 
JQs this way is not new and has been used in EMMAX [8], 
a fast version of EMMA. These estimates should be rea- 
sonable unless some SNPs have large effects, which is rare 
for most complex traits. 

Bayesian multiple QTL mapping 

To further identify SNPs interacting with each other and 
with other nongenetic factors, such as time, we consider 
the following mixed-effects model 

Yi = ix t + x\p° + xf p + xfp^ + xf> e + ut + a 
= fii+ Xifi + Ui + = 1, n) 

where X;[= (x°,x^, xf s ,x^ e )] is the design matrix corre- 
sponding to nongenetic factors, p putative SNPs, two- 
way interactions between p SNPs (resulting in total of p 
{p-l)/2 terms) and other selected SNP-environment 
interactions (for GAW18 data, we consider p SNP-age 
interactions); p[= (p eT , p sT , p^ T , p seT ) T ] is the vector of 
all fixed effects. We define Mi the same way as in model 
(1). The random effects m, and e, are also assumed to 
follow the same distributions as described in model (1). 
Model (2) includes the effects of all putative SNPs; thus, 



the number of such effects can be large. To identify 
SNPs associated with the trait of interest, we use a Baye- 
sian variable selection procedure in which we use a set 
of latent binary variables Kfe(fe = 1, cf) to indicate which 
of the c\ genetic effects (be they main genetic effects, 
epistasis effects and/or SNP by environment interac- 
tions) are associated [yk = 1) or not associated [yk = 0) 
with the trait. 

As in model (1), we assume matrix K, is known. We 
apply the Cholesky decomposition to K{ such that 
Ki = MjMj where M; is the n; x n; lower triangular Cho- 
lesky decomposition matrix of K[. Then model (2) can be 
reparameterized as Yi = Mi + x iP + <? g Mibi + where 
bi = (ha, bi ni ) T ~ N(0, l Ui ). We use the same prior dis- 
tributions for M, P, y = {y\, Yq) T , and in Yi et al [7]. 
We set the prior of °g to N + (m g0 ,Sg 0 ), where N + (fi 0 ,crQ) 
is the positive truncated normal density with mean Mo 
and variance a^, and both %o and s g0 are prespecified 
hyperparameters. The proposed method has been imple- 
mented upon the widely used R package, R/qtlbim [9] for 
these GAW18 longitudinal data. 

Results and discussion 

GAW1 8 data 

The GAW18 data included 849 individuals with both 
phenotype and imputed genotype data from 20 extended 
pedigrees. Each sample was measured multiple times on 
their blood pressures over approximately 5-year intervals. 
Among these 849 individuals, 139 were genetically unre- 
lated and were measured for age, sex, medication use, 
smoking status, and blood pressure. Our analysis was 
restricted to the 139 unrelated individuals. The number 
of SBP and DBP ranged from one to four for each sam- 
ple. WGS data provided by the GAW18 data had 
8,348,674 SNPs from odd numbered autosomes. All 
SNPs provided passed the initial quality control checking, 
but among 2,796,608 SNPs with minor allele frequency 
(MAF) greater than 0.05, 17,463 of them failed Hardy- 
Weinberg equilibrium (HWE) test (with j5-value < 0.05/ 
2,796,608, a Bonferroni corrected genome-wide thresh- 
old). We removed all SNPs with MAFs less than 0.05 
plus those not passing the HWE test, resulting in 
2,779,145 SNPs for the subsequent analyses. 

To check population outliers and potential population 
substructure, we generated a subset of SNPs that are not 
in high linkage disequilibrium (LD) with each other (i.e., 
r 2 < 0.5) an d performed the multidimensional scaling 
(MDS) analysis in PLINK [10]. Pairwise scatter plots of 
the top four MDS scores showed that the 139 individuals 
are homogeneous in terms of their ethnicities. However, 
two samples, T2DG0400207 and T2DG0400247, have an 
estimated IBD value of 0.3 between them, indicating that 
they are likely related. In our analysis, we retained all 139 
samples because the number of putatively related 
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samples is small and their inclusion should have a negli- 
gible effect on our analysis results. 

We applied the two proposed procedures to these fil- 
tered GAW18 data on the two log-transformed pheno- 
types, log(SBP) and log(DBP). Five covariates (age, age 2 , 
sex, medication use, and smoking status) were included 
for analyses. We fitted these data with different covariance 
matrices in SAS 9.2 and selected the spatial power covar- 
iance structure for the downstream analysis based on the 
AIC criteria. Specifically, we let cov(yij,yij<) = a 1 p di '»', 
where dig is the time distance between the jth and j'th 
examinations for individual i. After obtaining the para- 
meter estimate of P, p from model (1) with /3 s = 0, we 
substituted the kinship matrix K in EMMA by 
K = diag[Ki,k 2 , ...,£„) where K t = {p d <a'}- Figure 1(a) dis- 
plays the Manhattan plots of the two phenotypes 
from the extended EMMA model. For SBP, one 
SNP on Chr5:75506197 (p = 4.67 x 1CT 7 ) reached the 
genome-wide significance (p — value < 5 x 1CT 7 , a 
cutoff suggested by Burton et al [11]). For DBP, three 
SNPs on Chr3:23715851 (p - value = 9.00 x 1(T 8 ), 
Chrl7:54834217 (p-value = 1.98 x KT 7 ) and Chr21: 
18744081 (p - value = 4.95 x KT 7 ) exceeded the gen- 
ome-wide significance. 

Because of the limited sample size, it is not feasible to 
include all available SNPs in our Bayesian analysis. For 
each phenotype, we selected a list of 3000 top-ranked 



SNPs that are not highly correlated with each other 
(with correlation < 0.95 to avoid multicollinearity) from 
the extended EMMA for the Bayesian analysis. We 
applied this Bayesian method with the same covariates 
used in the extended EMMA method. For all analyses, 
the MCMC algorithm ran for 4 x 10 s iterations after 
the first 1000 burn-in iterations were discarded. The 
chain was then thinned for every 40 iterations, yielding 
10 4 MCMC samples for the posterior analysis. Based on 
the posterior inclusion probability of each SNP, the 
Bayes factor (BF) (see [6,7] for details) was estimated 
and used to judge the importance of each SNP. Figure 1 
(b) shows the Manhattan plots of 2ln[BF) for the com- 
bined genetic effects of each SNP, which include the 
main effects, epistasis effects, and SNP-age interactions. 
We found several additional SNPs with strong signals 
(BF > 30 as suggested by Yandell et al [12]) on 
Chrl:17876090 (BF = 102), Chr3: 197469358 (BF = 69), 
Chrl5:87675666 (BF = 43), and Chrl9:41642807 (BF = 
33) for SBP. No new SNPs were found for DBP. For 
SBP, we found one SNP located on Chr3:197469358 
(BF = 69) has a strong interaction with age. 

Simulations 

To evaluate the performances of the proposed methods, 
we conducted the following simulations. From the 3000 
top-ranked SBP SNPs previously selected, we randomly 
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Figure 1 Manhattan plots on Genetic Analysis Workshop 18 (GAW18) longitudinal data, (a) Manhattan plots of -log 10 (p-value) for systolic 
blood pressure (SBP) and diastolic blood pressure (DBP) from the extended EMMA. The two dashed horizontal lines represent the genome-wide 
thresholds for suggestive (p-value = 1CT 5 and significant (p-value = 5 x KT 7 ) associations, (b) Manhattan plots of 2 in (BF) for the proposed 
Bayesian method. Two dashed horizontal lines represent the genome-wide thresholds for moderate (BF = 10) and strong (BF = 30) associations. 
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picked up 10 of them that are at least 10 Mb apart as 
causal SNPs and called them SNP h ...,SNP w . Among the 
10 causal SNPs, we let 7 of them have only main 
effects, 2 have an epistasis effect, and 1 have an SNP- 
age interaction. The estimated correlation matrix 
diag(ki,K 2 , ...,K n ) along with cr? = 0.8 was used to 
simulate the random effects M;s. We set er 2 to 1. Specifi- 
cally, we simulated data according to the following 

model: y, = [SNP U + ... + SNP 7i + SNP 8l ■ SNP si )l ni + SNP 10 , ■ age, + u, + e, 

where u, ~ N(0,cr 2 fQ) and e, ~ N(0,cr e %> A total of 
100 simulations were performed. We compared the two 
proposed methods with each other and with two other 
existing methods, the original EMMA and R/qtlbim 
methods. The last two methods only work for univariate 
data, so we applied them to the simulated data with 
only first-time measurements used. To make the meth- 
ods comparable, we generated the receiver operating 
characteristic (ROC) curve for each method as described 
later. For a given cutoff of p-value or BF, we calculated 
the true and false positive findings as follows: a signifi- 
cant finding is claimed to be a true positive finding if it 
is located less than 1 Mb from any one of the simulated 
causal SNPs; otherwise the finding is false. The ROC 
curves with the false-positive rate less than 0.2 are pre- 
sented in Figure 2. Intuitively, our two methods that 
used all available data are more powerful than their cor- 
responding univariate analysis methods that only used 
the first-time data. Furthermore, the Bayesian method is 
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Figure 2 Receiver operating characteristic curves on simulated 
data. Solid line represents proposed Bayesian method (i.e., modified 
R/qtlbim) on all data; dashed line, R/qtlbim on first time point data; 
dotted line, extended EMMA on all data; and dot-dashed line, 
EMMA with first time point data. 



more powerful than the extended EMMA as expected 
because (a) the Bayesian model allows for SNP-SNP and 
SNP-age interactions, which are totally ignored by the 
extended EMMA, and (b) the Bayesian model jointly 
model multiple SNPs, but the extended EMMA only 
tests one SNP at a time. 

Conclusions 

In this paper, we developed two mixed-effects models 
for the GAW18 longitudinal blood pressure data. The 
first approach extends the EMMA method. We replace 
the kinship similarity matrix in EMMA with an esti- 
mated correlation matrix for dealing with the dependent 
structure of the repeated measurements. The second 
approach is a Bayesian method that models multiple 
SNPs simultaneously and allows for SNP-SNP interac- 
tions and SNP-time interactions. The advantages of the 
Bayesian method have been clearly demonstrated by our 
simulations. Both methods are currently developed for 
unrelated samples. The GAW18 data contained 
extended pedigrees. Ideally, we should use all available 
data in our analysis. What complicates the analysis on 
longitudinal pedigree data is that both the correlation 
structure of the repeated measurements and the familial 
correlation structure of related individuals should be 
considered. We are currently extending the two pro- 
posed methods for the GAW18 pedigree data. Further- 
more, for both our analyses, we assume that the 
covariance matrix is known up to a constant. For the 
Bayesian model, this assumption can be relaxed and we 
are developing a semiparametric approach where the 
covariance matrix is assumed unknown. We estimate 
the unknown covariance matrix with a modified Cho- 
lesky decomposition [13]. Last, our Bayesian model for 
GWAS data relies on a set of preselected putative SNPs. 
How to select a good set of putative SNPs, especially 
those with low marginal effects but high interactions 
with other SNPs or environmental factors is challenging 
and deserves further investigations. 
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